De oorspronkelijke data komt van Cleaveland UCI En is ook te vinden op Kaggle
Welk doel willen wij behalen?
Als we tijdens de proof of concept een nauwkeurigheid van 95% kunnen behalen bij het voorspellen of een patiënt al dan niet hartziekte heeft, zullen we het project voortzetten.
Hier volgt een overzicht van welke 'Features' er in de dataset bevinden.
# Import all the tools we need
# Regular EDA (exploratory data analysis) and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# we want our plots to appear inside the notebook
%matplotlib inline
# Models from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
# Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
# this will error in Scikit-Learn version 1.2+
# from sklearn.metrics import plot_roc_curve
# Available in Scikit-Learn version 1.2+
from sklearn.metrics import RocCurveDisplay
df = pd.read_csv("heart-disease.csv")
df.shape # (rows, columns)
(303, 14)
Het doel is om meer te weten te komen voer de gegevens.
df.head()
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
| 1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
| 2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
| 3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
| 4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
# De kolom `target` bevat de waarde die wij willen gaan voorspellen.
# Hoeveel zijn er aanwezig in de set? Er moet genoeg data inzitten om het model te kunnen trainen
df["target"].value_counts()
target 1 165 0 138 Name: count, dtype: int64
# Grafische weergave
df["target"].value_counts().plot(kind="bar", color=["salmon", "lightblue"]);
# informatie over de dataset
# Modellen kunnen niet overweg met tekst, maar wel met nummers. Daarom is het van belang om
# om de data om te zetten naar getallen.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 303 entries, 0 to 302 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 303 non-null int64 1 sex 303 non-null int64 2 cp 303 non-null int64 3 trestbps 303 non-null int64 4 chol 303 non-null int64 5 fbs 303 non-null int64 6 restecg 303 non-null int64 7 thalach 303 non-null int64 8 exang 303 non-null int64 9 oldpeak 303 non-null float64 10 slope 303 non-null int64 11 ca 303 non-null int64 12 thal 303 non-null int64 13 target 303 non-null int64 dtypes: float64(1), int64(13) memory usage: 33.3 KB
# Het is funest voor een model wanneer er velden in voorkomen die leeg zijn
# Controle of netjes gevuld is, zoniet dan moet de complete rij uit de dataset verdwijnen!
df.isna().sum()
age 0 sex 0 cp 0 trestbps 0 chol 0 fbs 0 restecg 0 thalach 0 exang 0 oldpeak 0 slope 0 ca 0 thal 0 target 0 dtype: int64
# Het is ook van belang dat de waarden niet teveel uit elkaar liggen.
# Als een kolom over de 10.000 of hoger gaat dan kan dit het model nadelig beïnvloeden. Dan zal de data
# aangepast moeten worden. Bijvoorbeeld huizenprijzen van 653.000 naar 6.53
df.describe()
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 |
| mean | 54.366337 | 0.683168 | 0.966997 | 131.623762 | 246.264026 | 0.148515 | 0.528053 | 149.646865 | 0.326733 | 1.039604 | 1.399340 | 0.729373 | 2.313531 | 0.544554 |
| std | 9.082101 | 0.466011 | 1.032052 | 17.538143 | 51.830751 | 0.356198 | 0.525860 | 22.905161 | 0.469794 | 1.161075 | 0.616226 | 1.022606 | 0.612277 | 0.498835 |
| min | 29.000000 | 0.000000 | 0.000000 | 94.000000 | 126.000000 | 0.000000 | 0.000000 | 71.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 47.500000 | 0.000000 | 0.000000 | 120.000000 | 211.000000 | 0.000000 | 0.000000 | 133.500000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 2.000000 | 0.000000 |
| 50% | 55.000000 | 1.000000 | 1.000000 | 130.000000 | 240.000000 | 0.000000 | 1.000000 | 153.000000 | 0.000000 | 0.800000 | 1.000000 | 0.000000 | 2.000000 | 1.000000 |
| 75% | 61.000000 | 1.000000 | 2.000000 | 140.000000 | 274.500000 | 0.000000 | 1.000000 | 166.000000 | 1.000000 | 1.600000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 |
| max | 77.000000 | 1.000000 | 3.000000 | 200.000000 | 564.000000 | 1.000000 | 2.000000 | 202.000000 | 1.000000 | 6.200000 | 2.000000 | 4.000000 | 3.000000 | 1.000000 |
# Om een beter beeld te krijgen van de data en of we deze ook snappen. We moeten uiteraard ook het model kunnen controleren.
# Kunnen we weergeven of het geslacht invloed heeft op hartziekten?
df.sex.value_counts()
sex 1 207 0 96 Name: count, dtype: int64
# Nu in een tabel:
pd.crosstab(df.target, df.sex)
| sex | 0 | 1 |
|---|---|---|
| target | ||
| 0 | 24 | 114 |
| 1 | 72 | 93 |
# Tabel per geslacht
pd.crosstab(df.target, df.sex).plot(kind="bar",
figsize=(10, 6),
color=["salmon", "lightblue"])
plt.title("Heart Disease Frequency for Sex")
plt.xlabel("0 = No Diesease, 1 = Disease")
plt.ylabel("Amount")
plt.legend(["Female", "Male"]);
plt.xticks(rotation=0);
# Leeftijd versus maximale hartslag
plt.figure(figsize=(10, 6))
# Scatter met positieve waarden
plt.scatter(df.age[df.target==1],
df.thalach[df.target==1],
c="salmon")
# Scatter met negatieve waarden
plt.scatter(df.age[df.target==0],
df.thalach[df.target==0],
c="lightblue")
# Metadata toevoegen
plt.title("Heart Disease in function of Age and Max Heart Rate")
plt.xlabel("Age")
plt.ylabel("Max Heart Rate")
plt.legend(["Disease", "No Disease"]);
# Hoe zit het wanneer de hartziektes versus leeftijd?
df.age.plot.hist();
# Data voorbereiden voor machine learning:
df.head()
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
| 1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
| 2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
| 3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
| 4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
# We willen de kolom 'target' gaan voorspellen en mag niet voorkomen in X
# Split data into X and y
X = df.drop("target", axis=1)
y = df["target"]
X
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 |
| 1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 |
| 2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 |
| 3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 |
| 4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 298 | 57 | 0 | 0 | 140 | 241 | 0 | 1 | 123 | 1 | 0.2 | 1 | 0 | 3 |
| 299 | 45 | 1 | 3 | 110 | 264 | 0 | 1 | 132 | 0 | 1.2 | 1 | 0 | 3 |
| 300 | 68 | 1 | 0 | 144 | 193 | 1 | 1 | 141 | 0 | 3.4 | 1 | 2 | 3 |
| 301 | 57 | 1 | 0 | 130 | 131 | 0 | 1 | 115 | 1 | 1.2 | 1 | 1 | 3 |
| 302 | 57 | 0 | 1 | 130 | 236 | 0 | 0 | 174 | 0 | 0.0 | 1 | 1 | 2 |
303 rows × 13 columns
# y bevat de volgende waarden:
y
0 1
1 1
2 1
3 1
4 1
..
298 0
299 0
300 0
301 0
302 0
Name: target, Length: 303, dtype: int64
# Split data into train and test sets
np.random.seed(42)
# Split into train & test set
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2)
X_train
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 132 | 42 | 1 | 1 | 120 | 295 | 0 | 1 | 162 | 0 | 0.0 | 2 | 0 | 2 |
| 202 | 58 | 1 | 0 | 150 | 270 | 0 | 0 | 111 | 1 | 0.8 | 2 | 0 | 3 |
| 196 | 46 | 1 | 2 | 150 | 231 | 0 | 1 | 147 | 0 | 3.6 | 1 | 0 | 2 |
| 75 | 55 | 0 | 1 | 135 | 250 | 0 | 0 | 161 | 0 | 1.4 | 1 | 0 | 2 |
| 176 | 60 | 1 | 0 | 117 | 230 | 1 | 1 | 160 | 1 | 1.4 | 2 | 2 | 3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 188 | 50 | 1 | 2 | 140 | 233 | 0 | 1 | 163 | 0 | 0.6 | 1 | 1 | 3 |
| 71 | 51 | 1 | 2 | 94 | 227 | 0 | 1 | 154 | 1 | 0.0 | 2 | 1 | 3 |
| 106 | 69 | 1 | 3 | 160 | 234 | 1 | 0 | 131 | 0 | 0.1 | 1 | 1 | 2 |
| 270 | 46 | 1 | 0 | 120 | 249 | 0 | 0 | 144 | 0 | 0.8 | 2 | 0 | 3 |
| 102 | 63 | 0 | 1 | 140 | 195 | 0 | 1 | 179 | 0 | 0.0 | 2 | 2 | 2 |
242 rows × 13 columns
We gaan de volgende 3 modellen testen hoe deze presteren
# Put models in a dictionary
models = {"Logistic Regression": LogisticRegression(n_jobs=-1),
"KNN": KNeighborsClassifier(n_jobs=-1),
"Random Forest": RandomForestClassifier(n_jobs=-1)}
# Create a function to fit and score models
def fit_and_score(models, X_train, X_test, y_train, y_test):
"""
Fits and evaluates given machine learning models.
models : a dict of differetn Scikit-Learn machine learning models
X_train : training data (no labels)
X_test : testing data (no labels)
y_train : training labels
y_test : test labels
"""
# Set random seed
np.random.seed(42)
# Make a dictionary to keep model scores
model_scores = {}
# Loop through models
for name, model in models.items():
# Fit the model to the data
model.fit(X_train, y_train)
# Evaluate the model and append its score to model_scores
model_scores[name] = model.score(X_test, y_test)
return model_scores
model_scores = fit_and_score(models=models,
X_train=X_train,
X_test=X_test,
y_train=y_train,
y_test=y_test)
model_scores
{'Logistic Regression': 0.8852459016393442,
'KNN': 0.6885245901639344,
'Random Forest': 0.8360655737704918}
model_compare = pd.DataFrame(model_scores, index=["accuracy"])
model_compare.T.plot.bar();
Nu hebben we een baseline model, nu gaan we onderzoeken waar we nog kunnen verbeteren.
Mogelijke opties:
Hyperparameters zijn parameters die per model apart gegegeven kunnen worden. (Zie n_jobs=-1)
# Let's tune KNN
train_scores = []
test_scores = []
# Create a list of differnt values for n_neighbors
neighbors = range(1, 21)
# Setup KNN instance
knn = KNeighborsClassifier()
# Loop through different n_neighbors
for i in neighbors:
knn.set_params(n_neighbors=i)
# Fit the algorithm
knn.fit(X_train, y_train)
# Update the training scores list
train_scores.append(knn.score(X_train, y_train))
# Update the test scores list
test_scores.append(knn.score(X_test, y_test))
train_scores
[1.0, 0.8099173553719008, 0.7727272727272727, 0.743801652892562, 0.7603305785123967, 0.7520661157024794, 0.743801652892562, 0.7231404958677686, 0.71900826446281, 0.6942148760330579, 0.7272727272727273, 0.6983471074380165, 0.6900826446280992, 0.6942148760330579, 0.6859504132231405, 0.6735537190082644, 0.6859504132231405, 0.6652892561983471, 0.6818181818181818, 0.6694214876033058]
test_scores
[0.6229508196721312, 0.639344262295082, 0.6557377049180327, 0.6721311475409836, 0.6885245901639344, 0.7213114754098361, 0.7049180327868853, 0.6885245901639344, 0.6885245901639344, 0.7049180327868853, 0.7540983606557377, 0.7377049180327869, 0.7377049180327869, 0.7377049180327869, 0.6885245901639344, 0.7213114754098361, 0.6885245901639344, 0.6885245901639344, 0.7049180327868853, 0.6557377049180327]
# Een grafiekje zou wel zo mooi zijn:
plt.plot(neighbors, train_scores, label="Train score")
plt.plot(neighbors, test_scores, label="Test score")
plt.xticks(np.arange(1, 21, 1))
plt.xlabel("Number of neighbors")
plt.ylabel("Model score")
plt.legend()
print(f"Maximum KNN score on the test data: {max(test_scores)*100:.2f}%")
Maximum KNN score on the test data: 75.41%
Om elke parameter met de hand aan te passen en opnieuw te draaien gaat veel tijd kosten. Gelukkig is hier een functie voor :)
Voor de volgende onderdelen gaan we het automatiseren met RandomizedSearchCV.
# Create a hyperparameter grid for LogisticRegression
log_reg_grid = {"C": np.logspace(-4, 4, 20),
"solver": ["liblinear"]}
# Create a hyperparameter grid for RandomForestClassifier
rf_grid = {"n_estimators": np.arange(10, 1000, 50),
"max_depth": [None, 3, 5, 10],
"min_samples_split": np.arange(2, 20, 2),
"min_samples_leaf": np.arange(1, 20, 2)}
# Tune LogisticRegression
np.random.seed(42)
# Setup random hyperparameter search for LogisticRegression
rs_log_reg = RandomizedSearchCV(LogisticRegression(),
param_distributions=log_reg_grid,
cv=5,
n_iter=20,
n_jobs=-1,
verbose=True)
# Fit random hyperparameter search model for LogisticRegression
rs_log_reg.fit(X_train, y_train)
Fitting 5 folds for each of 20 candidates, totalling 100 fits
RandomizedSearchCV(cv=5, estimator=LogisticRegression(), n_iter=20, n_jobs=-1,
param_distributions={'C': array([1.00000000e-04, 2.63665090e-04, 6.95192796e-04, 1.83298071e-03,
4.83293024e-03, 1.27427499e-02, 3.35981829e-02, 8.85866790e-02,
2.33572147e-01, 6.15848211e-01, 1.62377674e+00, 4.28133240e+00,
1.12883789e+01, 2.97635144e+01, 7.84759970e+01, 2.06913808e+02,
5.45559478e+02, 1.43844989e+03, 3.79269019e+03, 1.00000000e+04]),
'solver': ['liblinear']},
verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=5, estimator=LogisticRegression(), n_iter=20, n_jobs=-1,
param_distributions={'C': array([1.00000000e-04, 2.63665090e-04, 6.95192796e-04, 1.83298071e-03,
4.83293024e-03, 1.27427499e-02, 3.35981829e-02, 8.85866790e-02,
2.33572147e-01, 6.15848211e-01, 1.62377674e+00, 4.28133240e+00,
1.12883789e+01, 2.97635144e+01, 7.84759970e+01, 2.06913808e+02,
5.45559478e+02, 1.43844989e+03, 3.79269019e+03, 1.00000000e+04]),
'solver': ['liblinear']},
verbose=True)LogisticRegression()
LogisticRegression()
# Er is ook een ingebouwde functie om de hyperparameters terug te geven die de hoogtste score heeft
rs_log_reg.best_params_
{'solver': 'liblinear', 'C': 0.23357214690901212}
# test op de test data
rs_log_reg.score(X_test, y_test)
0.8852459016393442
Nu hebben we de beste hyperparameters voor LogisticRegression(), nu voor de RandomForestClassifier()...
# Setup random seed
np.random.seed(42)
# Setup random hyperparameter search for RandomForestClassifier
rs_rf = RandomizedSearchCV(RandomForestClassifier(),
param_distributions=rf_grid,
cv=5,
n_iter=20,
n_jobs=-1,
verbose=True)
# Fit random hyperparameter search model for RandomForestClassifier()
rs_rf.fit(X_train, y_train)
Fitting 5 folds for each of 20 candidates, totalling 100 fits
RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_iter=20,
n_jobs=-1,
param_distributions={'max_depth': [None, 3, 5, 10],
'min_samples_leaf': array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]),
'min_samples_split': array([ 2, 4, 6, 8, 10, 12, 14, 16, 18]),
'n_estimators': array([ 10, 60, 110, 160, 210, 260, 310, 360, 410, 460, 510, 560, 610,
660, 710, 760, 810, 860, 910, 960])},
verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_iter=20,
n_jobs=-1,
param_distributions={'max_depth': [None, 3, 5, 10],
'min_samples_leaf': array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]),
'min_samples_split': array([ 2, 4, 6, 8, 10, 12, 14, 16, 18]),
'n_estimators': array([ 10, 60, 110, 160, 210, 260, 310, 360, 410, 460, 510, 560, 610,
660, 710, 760, 810, 860, 910, 960])},
verbose=True)RandomForestClassifier()
RandomForestClassifier()
# De beste hyperparameters voor
rs_rf.best_params_
{'n_estimators': 210,
'min_samples_split': 4,
'min_samples_leaf': 19,
'max_depth': 3}
# Evaleer de ge-randomiseerd zoeken voor het RandomForestClassifier model
rs_rf.score(X_test, y_test)
0.8688524590163934
Kijken of we het nog kunnen verbeteren met GridSearchCV
# Andere hyperparameters voor het LogisticRegression model
log_reg_grid = {"C": np.logspace(-4, 4, 30),
"solver": ["liblinear"]}
# Configuratie grid hyperparameter search voor LogisticRegression
gs_log_reg = GridSearchCV(LogisticRegression(),
param_grid=log_reg_grid,
cv=5,
n_jobs=-1,
verbose=True)
# Pas grid toe op hyperparameter search model
gs_log_reg.fit(X_train, y_train);
Fitting 5 folds for each of 30 candidates, totalling 150 fits
# Controleer de beste hyperparmaters
gs_log_reg.best_params_
{'C': 0.20433597178569418, 'solver': 'liblinear'}
# Evalueer het LogisticRegression model
gs_log_reg.score(X_test, y_test)
0.8852459016393442
Om een model te kunnen evalueren zijn er een aantal standaard functies/methodes om de score van een bepaald model te kunnen bepalen. Dit kan op basis van het volgende:
Het zou ook fijn zijn als cross-validation gebruikt wordt waar het mogelijk is.
Om een getraind model te kunnen vergelijken moeten we eerst voorspellingen maken.
# Maak voorspellingen met het aangepaste model
y_preds = gs_log_reg.predict(X_test)
y_preds
array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int64)
# Plot ROC curve and calculate and calculate AUC metric
RocCurveDisplay.from_estimator(gs_log_reg, X_test, y_test)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x1438e9d4550>
# Confusion matrix
print(confusion_matrix(y_test, y_preds))
[[25 4] [ 3 29]]
# Geef fusion matrix weer:
sns.set(font_scale=1.5)
def plot_conf_mat(y_test, y_preds):
"""
Plots a nice looking confusion matrix using Seaborn's heatmap()
"""
fig, ax = plt.subplots(figsize=(3, 3))
ax = sns.heatmap(confusion_matrix(y_test, y_preds),
annot=True,
cbar=False)
plt.xlabel("True label")
plt.ylabel("Predicted label")
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plot_conf_mat(y_test, y_preds)
Nu we de ROC curvem AUC metric en confusion matrix hebben kunnen het classificatie rapport, de cross-validatie graad, recall en F1-score genereren.
print(classification_report(y_test, y_preds))
precision recall f1-score support
0 0.89 0.86 0.88 29
1 0.88 0.91 0.89 32
accuracy 0.89 61
macro avg 0.89 0.88 0.88 61
weighted avg 0.89 0.89 0.89 61
We gaan de accuracy, precision, recall and f1-score van ons model berekenen door cross-validation toe te passen cross_val_score().
# Check hyperparameters
gs_log_reg.best_params_
{'C': 0.20433597178569418, 'solver': 'liblinear'}
# Create a new classifier with best parameters
clf = LogisticRegression(C=0.20433597178569418,
solver="liblinear")
# Cross-validated accuracy
cv_acc = cross_val_score(clf,
X,
y,
cv=5,
scoring="accuracy")
cv_acc
array([0.81967213, 0.90163934, 0.8852459 , 0.88333333, 0.75 ])
cv_acc = np.mean(cv_acc)
cv_acc
0.8479781420765027
# Cross-validated precision
cv_precision = cross_val_score(clf,
X,
y,
cv=5,
scoring="precision")
cv_precision=np.mean(cv_precision)
cv_precision
0.8215873015873015
# Cross-validated recall
cv_recall = cross_val_score(clf,
X,
y,
cv=5,
scoring="recall")
cv_recall = np.mean(cv_recall)
cv_recall
0.9272727272727274
# Cross-validated f1-score
cv_f1 = cross_val_score(clf,
X,
y,
cv=5,
scoring="f1")
cv_f1 = np.mean(cv_f1)
cv_f1
0.8705403543192143
# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({"Accuracy": cv_acc,
"Precision": cv_precision,
"Recall": cv_recall,
"F1": cv_f1},
index=[0])
cv_metrics.T.plot.bar(title="Cross-validated classification metrics",
legend=False);
Feature belang is een andere manier van vragen, "Welke features dragen het meeste bij aan de uitkomst van het model en hoe hebben deze bijgedragen?"
Laten we het feature belang van LogisticRegression model onderzoeken...
# Fit an instance of LogisticRegression
clf = LogisticRegression(C=0.20433597178569418,
solver="liblinear")
clf.fit(X_train, y_train);
# Check coef_
clf.coef_
array([[ 0.00320769, -0.86062047, 0.66001431, -0.01155971, -0.00166496,
0.04017239, 0.31603402, 0.02458922, -0.6047017 , -0.56795457,
0.45085391, -0.63733326, -0.6755509 ]])
df.head()
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
| 1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
| 2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
| 3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
| 4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
# Match coef's of features to columns
feature_dict = dict(zip(df.columns, list(clf.coef_[0])))
feature_dict
{'age': 0.0032076873709286024,
'sex': -0.8606204735539111,
'cp': 0.6600143086174385,
'trestbps': -0.01155970641957489,
'chol': -0.0016649609500147373,
'fbs': 0.04017238940156104,
'restecg': 0.3160340177157746,
'thalach': 0.02458922261936637,
'exang': -0.6047017032281077,
'oldpeak': -0.567954572983317,
'slope': 0.4508539117301764,
'ca': -0.6373332602422034,
'thal': -0.6755508982355707}
# Visualize feature importance
feature_df = pd.DataFrame(feature_dict, index=[0])
feature_df.T.plot.bar(title="Feature Importance", legend=False);
pd.crosstab(df["sex"], df["target"])
| target | 0 | 1 |
|---|---|---|
| sex | ||
| 0 | 24 | 72 |
| 1 | 114 | 93 |
pd.crosstab(df["slope"], df["target"])
| target | 0 | 1 |
|---|---|---|
| slope | ||
| 0 | 12 | 9 |
| 1 | 91 | 49 |
| 2 | 35 | 107 |